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This invention relates to video cameras. 

Video camera/recorder arrangements, including so-called camcorders, are commonly 
5 used in home and professional applications. Generally fhey. store audio and video material 
on tape media, but other storage media such as optical or magnetic disk storage have been 
proposed. 

Recently it has been proposed that professional camcordCTS might store some so- 
called "metadata" (additional data) along with the audio and video material that they capture. 
10 This metadata could be stored on the tape with the audio and video information, or it could 
be stored on a separate recording media such as a flash memory card, or it could be 
transmitted by a wireless link to an external database. In any of these situations, a main 
purpose of ttie metadata is to assist users in making full use of &e material later. 

Some metadata is generated by a human operator (e.g. using a keyboard) and might 
15 define the location of filing; the actors / presenters; the date and time; the ptoductibn staff; 
the type of camera; whether or not a current clip is considered to be a "good shof ' by the 
cameraman etc. Another class of automated metadata may be generated by the camcorder 
and associated equipment, for example defining the focus, zoom and aperture settings of the 
camera lens, the geographical position (via a GPS receiver), the camera's maintenance 
20 schedule and so on. 

While this latter class of metadata is useful to an extent, when a user later needs to 
locate a particular video cUp j&om a large group of archived video clips, the more useful 
metadata is the first class, that generated by a human operator. For example, the later user is 
far more likely to search for a chp containing a particular celebrity than a cUp in which a Fuji 
25 lens was used at an sqperture fl .8. However, although the human-genierated metadata is often 
the more useful, it is very time-consundng {and therefore expensive) for someone to enter all 
of the required data at or soon after cq>ture of the material. 

This invention provides a video camera arrangement compdsing: 
an image c^qpture device; 

30 a face detector for detecting human faces in the c£^tured video material and for 

generating face data identifying detected occurrences of faces in the c^tured video material; 

a data handling medium by which data representing the captured images is 
transmitted and/or stored; and 
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a processor for generating data to be transmitted or stored by the data handling m 
medium in dependence on the detection of faces in the captured images. 

The invention addresses^ the above problems by, in at least some embodiments, 
providing a new class of machuie^generated metadata, namely face images, which can be 
5 stored along with the captured video matraal, but which h^^ 

user trying to establish quickly the content of the video material) much more on a level wifli 
human-generated metadata. In other embodiments the nature of iinage signals 
communicated from the camera arrangement to a rraaote node can be altered in dependence 
on the face detection, for example to achieve a reduction in or a better use of available 

10 transmission bandwidth. 

The occurrences of faces inay be treated as siinply an identifi 

a field or frame) but preferably also include a position v^ 

The camera arrangement is preferably, though not necessarily^ a imitary arrangement 
otherwise knows as a camcorder, 
15 Further respective aspects and features of flie inventidn are defined in the appended 

claims. 

Embodiments of the invention wiU now be desdribed, by way of example only, with 
reference to the accompanying drawings, throughout which like parts are defined by like 
numerals, and in which: 

20 Figure 1 is a schematic diagram of a graeral purpoise computer system for use as a 

face detection system and/or a non-linear editing system; 

Figure 2 is a schematic diagram of a video camera-recordCT (camcorder) using face 
detection; 

Figure 3 is a schematic diagram iUustrating a traiim 
25 Figure4isaschCTiaticdiagramiUiistratingad^ 

Figure 5 schematically illustrates a feabire histogram; 

Figure 6 schematically illustrates a sampling process to generate eigenblocks; 
Figures 7 and 8 schematically illustrates sets of eigenblocks; 

Figure 9 schematically illustrates a process to build a histogram representing a block 
30 position; 

Figure 10 schematically illustrates the generation of a histogram bin number; 
Figure 1 1 schematically illustrates the calculation of a face probabihty; 
Figures 12a to 12f are schematic examples of histograms generated using flie above 
methods; 
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Figures 13a to 13g schematically illustrate so-called multiscale face detection; 
Figure 14 schematically illustrates a face tracking algorithm; 

Figures 15a and 15b schematically illustrate the derivation of a search area used for 
skin colour detection; 

5 Figure 1 6 schematically illustrates a mask applied to skia colour deflection; 

Figures 1 7a to 1 7c schematically illustrate, the use of the mask of Figure 1 6; 
Figure 1 8 is a schematic distance map; 

Figures 19a to 19c schematically illustrate the use of face tracking when appUed to a 
video scene; 

10 Figure 20 schematically illustrates a display screen of a non-linear editing system; 

Figure 21a and 21b schraiatically illustrate clip icons; Figures 22a to 22c 
schematically illustrate a gradient pre-^processing technique; 

Figure 23 schematically iUustrates a video conferencing system; 

Figures 24 and 25 schematically illustrate a video conferencing systrai in greater 

15 detail; 

Figure 26 is a flowchart schematically illustrating one mode of operation of the 
system of Figures 23 to 25; 

Figures 27a and 27b are example images relating to the flowchart of Figure 26; 
Figure 28 is a flowchart schematically illustrating another mode of operation of the 
20 system of Figures 23 to 25; 

Figures 29 and 30 are example images relating to the flowchart of Figure 28; 
Figure 31 is a flowchart schematically illustrating another mode of operation of the 
systCTi of Figures 23 to 25; 

Figure 32 is an example image relating to the flowchart of Figure 3 1 ; and 
25 Figures 33 and 34 are flowcharts schanatically illustrating further modes of 

operation of the system of Figures 23 to 25; 

Figure 1 is a schematic diagram of a general purpose compute system for use as a 
face detection systCToi and/or a non-linear editing system. The computer system comprises a 
processing until 10 having (amongst other convmtional components) a central processing 
30 unit (CPU) 20, memory such as a random access memory (RAM) 30 and non-volatile 
steerage such as a disc drive 40. The computer system may be connected to a network 50 
such as a local area network or the Internet (or bodi). A keyboard 60, mouse or other user 
iiq;)ut device 70 and display screen 80 are also provided. The skilled man will appreciate that 
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a general purpose computer system may include many other conventional parts which need 
not he described here. 

Figure 2 is a schematic diagtam of a video camera-recdrder (camcordCT) using face 
detection. The c^corder 100 comprises a lens 110 which focuses an image onto a charge 
5 coupled device (CCD) image captra-e device 120. The residting image in electronic form is 
processed by image processing logic 130 for recording on a recording medium such as a tape 
cassette 140. The images captured by the device 120 are also displayed on a user display 
1 50 which may be viewed through an eyepiece 160. 

To csqpture sounds associated with the images, one or more microphones are used. 
10 Theseinaybeextemalrnicrophones, in the sense that they are coxmected to t^^ 

a flexible cable, or maybe mounted on the camcorder body itself Analogue audio signals 
from the microphone (s) are pix>cessed by an audio processing arrangement 170 to produce 
appropriate audio signals for recording on the borage medium 140. 

It is noted that the video and audio signals may be recorded on the storage medium 
15 140 in either digital form or analogue form, or even in bofli forms. Thus, the image 
processing arrangement 130 and the audio processing arrangement 170 may include a stage 
of analogue to digital conversion. 

The camcorder user is able to control aspects of the lens 1 lO's performance by user 
controls 180 which influence a lens control arrangement 190 to send electrical control 
io signals 200 to the lens 110. Typically, attributes such as focus and zoom are controlled in 
this way, but the lens aperture or other attributes may also be controlled by the user. 

Two further user controls are schematically illustrated. A push button 210 is 
provided to initiate and stop recording onto the recording medium 140. For example, one 
push of the control 210 may start recording and another push may stop recording, or the 
25 co^rol'inay need to be held iii a pushied state for recording to take place, or one push may 
start recording for a certain tiined period, for example five seconds. In any of these 
arrangements, it is technologically very straightforward to establish from the camcorder's 
record operation where flie beginning and end of each "shof (continuous period of 
recording) occurs. 

30 The other user control shown schematically in Figure 2 is a "good shot marker** 

(GSM) 220, which rnay be operated by the user to cause '^metadata** (associated data) to be 
stored in connection with the video and audio material on the recording medium 140, 
indicating that ttiis particular shot was subjectively considered by the operator to be "good** 
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in some respect (for example, the actors perfonned particularly well; the news reporter 
pronounced each word coirecUy; and so on). 

The metadata maty be recorded in some spare capacity (e.g. '\iser data") on the 
recording medimn 140, depending on the particular format and standard in use: 

5 Alternatively, the metadata can be stored on a separate storage medixma such as a removable 
MemoryStick^™ memory (not shown), or the, metadata could be stored on an external 
database (not shown), for example being coromimicated to such a database by a wireless link 
(not shown). The metadata can include not only the GSM information but also shot 
boundaries, lens attributes, alphanumeric inforaiation input by a user (e.g. on a keyboard — 

10 not shown), g^graphical position information from a global positioning system receiver (not 
shown) and so on. 

So far, the description has covered a metadata-enabled camcorder. Now, the way in 
which face detection nmy be sqpphed to such a camcprdCT wiU be desc^ 

The camcorder includes a face detector arrangement 230. Appropriate arrangemCTits 
15 will be described in much greater detail below, but for this part of the description it is 
sufficient to say that tiie fece detector arrangement 230 recdLves images from the image 
processing arrangement 130 and detects, or attCTipts to detect^ whetha: such images contain 
one or more faces. The face detector may output face detection data which could be in the 
form of a ''yes/no" flag or maybe more detailed in tiiat the data could include the image co- 
20 ordinates of the faces, such as the co-ordinates of eye positions within each detected face. 
This information may be treated as another type of metadata and stored in any of the other 
formats described above. 

As described below, face detection may be assisted by using other types of metadata 
within tiie detection process. For ejcan^le, the fiace detector 230 receives a control signal 
25 from the Icts control arrangement 190 to indicate the cunrent focus and zoom settings of the 
lens 1 10. These can assist the fece detector by giving an initial indication of the expected 
image size of any feces fhaX may be i^resCTt in the foregroxmd of the image. In this regard, it 
is noted that the focus and zoom settings between Ihem define the expected separation 
between the camcord^ 100 and a person being filmed, and also the magnification of the lens 
30 110. From these two attributes, based upon an average face size, it is possible to calculate 
the expected size (in pixels) of a fece in the resulting image data. 

A conventional (known) q)eech detector 240 receives audio information from the 
audio processing arrang^nent 170 and detects the presence of speech in such audio 
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information. The presence of speech may be an indicator lhat the likelihood of a face bang - 
present in the corresponding images is higher than if no speech is detected. 

Finally, the GSM information 220 and shot information (from the control 210) are 
siq)plied to the fece detector 230, to indicate shot boimdaries and those -shots considered to 
5 bemostnsefiilbyfheuser. 

Of course, if the camcorder is based npon the analogue rdx>rding technique, further 
analogue to digital converters (ADCs) may be required to handle the image and audio 
information. 

The present embodiment uses a face detection technique arranged as two phases. 
10 Figure 3 is a schranatic diagram Ulustrating a training phase, and Figure 4 is a schematic 
diagram illustratinig a detection phase. 

Unlike some previously proposed face detection methods (see References 4 and 5 
below), the present method is basied on modelling the face in parts instead of as a whole, the 
parts can either be blocks centred over the assumed positions of the facial feature (so-called 
15 "selective sampling") or blocks sampled at regular intervals over the face (so-called "regular 
sampling"). The present description will cover primarily regular sanq>ling, as fkas was found 
in empirical tests to jgive tiie better results. 

In flie training phase, an analysis process is applied to a set of images known to 
contain faces, and (optionally) another set of images ("nonfece images") known not to 
20 contain faces. The analysis process builds a mathematical model of fecial and nonfadal 
features, against which a test image can later be compared (in the detection phase). 

So, to build tie mathematical model (the training process 310 of Figure 3), the basic 
stq[>s are as follows: 

1 . From a set 300 of fece images normalised to have the same eye positions, each fece is 
25 i5anq>led regularly into sniall blocks. 

2. Attributes are calculated for each block; these attributes are plained further below. 

3. The attributes are quantised to a manageable nunaber of different values. 

4. The quantised attributes are then combined to genorate a single quantised vahie in 
respect of that block position. 

30 5. The single quantised value is then recorded as an entry in a histogram, such as flie 
schematic histogram of Figure 5. The collective histogram information 320 in respect of aU 
of the block positions in all of the training images forms fb& foundation of the mathematical 
model of the facial features. 
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One such histogram is pr^ared for esadi possible block iwsition, by repeating ttie 
above steps in respect of a large nvimber of test face images. The test data are described 
further in Appendix A below. So, in a systCTi which uses an array of 8 x 8 blodcs, €4 
histograms are prepared. In a later part of the processing, a test quantised atbibute is 
5 compared with the histogram data; the feet that a whole histogram is used to model flie data 
means that no assumptions have to be inade'^about whether it follows a parameteriised 
distribution, e.g. Gaussian or otherwise. To save data storage space (if needed), histograms 
which are similar can be merged so that the same histogram can be reused for diffewait block 
positions. 

10 In die detection phase, to ^ly the face detector to a test image 350, successive 

'Windows in the test image are processed 340 as follows: 

6. Tbe window is sampled regularly as a series of blocks, and attributes in respect of 
each block are calculated and quantised as in. stages 1-4 above. 

7. Corre^onding '•probabilities" for the quantised attribute values for eadh. block 
15 position are looked up from the corresponding histograms. That is to say, for each blodc 

position, a respective quantised attribute is generated and is compared with a histogram 
previously generated in respect of that block position. The way in which flie histograms g^ve 
rise to "probabiUty" data wiU be described below. 

8. An flie probabilities obtained above are multiplied together to form a final probability 
20 which is compared against a threshold in order to classify the window as "fece" or 

*'nohfece". It will be appreciated that the detection result of "face" or "nonface" is a 
probability-based measure rather than an absolute detection. S(Hnetimes, an image not 
containing a fece may be wrongly detected as "face", a so-called false positive. At other 
times, an image containing a face may be wrongly detected as **non£ace", a so-called felse 
25 negative. It is an aim of any face detection system to reduce the proportion of felse positives 
and flie proportion of felse negatives, but it is of course understood Aat to reduce these 
proportions to zero is difficidt, if not iBcq)ossible, with current tedmology. 

As mentioned above, in the training phase, a set of "nonface" images can be used to 
^erate a corresponding set of "nonface" histograms. Then, to achieve detection of a fece, 
30 tb& "probabiUty" jHcoduced fix>m the noiiface histo^ams may be conq)aied wifli a separate 
threshold, so that the probabihty has to be under the threshold for the test window to contain 
a fece. Alternatively, the ratio of the fece probability to the nonface probability could be 
C(nxq>ared widi a threshold. 
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Extra training data may be generated by applying ''synthetic variations** 330 to the 
original training set, such as variations in position, orientation, size, aspect ratio, backg^und 
sceneiy, lighting intensity and firequency content 

The derivation of attributes and their quantisation will now be described. In the 
5 present technique, attributes are measured with respect to so-called eigCTblocks, which are 
core blocks (or eigenvectors) representing different types of block which may be present in 
the windowed image. The generation of eigenblocks will first be described with reference to 
Figure6. 

10 Eigenblock creatioii 

The attributes in the presrat embodimCTt are based on* so-called eigenblocks^ The 
eigenblocks were designed to have good representational ability of the blocks in the training 
set Therefore, tiiey were created by perfonriing principal component analysis on a large set 
of blocks fix)m the training set This process is shown schematically in Figure 6 and 

15 described in more detail in Appendix B- 

Training the System 

Experrments wctc performed with two different sets of training blocks. 

20 Eigenblock set I 

Initially, a set of blocks were used that were taken firom 25 face images in the training 
set The 16x16 blocks were sampled every 16 pixels and so were non-overlapping. This 
sampling is shown in Figure 6. As can be seen, 16 blocks are generated from each 64x64 
training imagd TWs leads to a total of 400 training blocks overall. 

25 The fifst 10 eigenblocks generated from these triaixmig blodcs are shown in Figure 7. 

Eigenblock set n 

A second set of eigenblocks was gmerated from a much larger set of training blocks. 
These blocks were taken from 500 face images in the training set In this case, the 16x16 
30 blocks were sampled every 8 pixels and so overl^ped by 8 pixels. This generated 49 blocks 
from each 64x64 traimng image and led to a total of 24,500 training blocks. 

The first 12 eigenblocks generated from these training blocks are shown in Figure 8. 

l&rpirical results show that eigenblock set 11 gives slightly better results than set I. 
This is because it is calculated from a larger set of training blocks taken &om face images. 
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and SO is perceived to be better at representing the variations in feces. However, the 
imprnvement in perfoixnance is not large. 

Bnilding flie BHstograms 
5 A histogram was built for each sarcqpled block position within the 64x64 fece imaige. 

The number of histograms depends on the block spacing. For exan^l^ for Mock spacing of 
16 pixels, there are 16 possible block positions and thus 16 histograms are used. 

The process used to build a histogram representing a single block position is shown 
in Figure 9. The histograms are created using a large training set 400 of M face images. For 
.10 each face image, the process comprises: . 

• Extracting 410 the relevant block, ftom a position <ij) in the fece image. 

• Calculating the eigehblockrbased attributes for the block, and determining the relevant 
bin number 420 from these attributes. 

• hicrementing the relevant bin number in. the histogram 430. 

15 This process is repeated for each of M im^es in die training set, to create a 

histogram that gives a good representation of the distribution of ftequency of occurrence of 
the attributes. Ideally, M is very large, e.g. several thousand. This can more easily be 
adiieved by using a training set made up of a set of original feces and several hundred 
synthetic variations of each original face. 

20 

Generatinp the histogram bin number 

A histogram bin number is generated fiom a given block using the following process, 
as shown m Figure 10. The. 16x16 block 440 is extracted firom the 64x64 window or fece 
image. The block is projected onto the set 450 of A eigenblodcs to generate a set of 
25 "eigeriblock wei^ts". These eigenblock wei^ts are the "attributes" used in this 
implementation. They have a range of -1 to +1. This process is described in more detail in 
App^dix B. Each weight is quantised into a fixed number of levels, L, to produce a set of 
quantised attributes 470, w„i' = l..^. The quantised weights are combined into a sfaigle 

value as follows: 
30 A = w, X^-' + Wji^-' + Wji^-' + ... + + w^X" 

where the value gen^sted, h, is tiie histogram bin number 480. Note that the total numbo^ of 
bins in the histogram is giv^ by Z,'* . 
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The bin "contents", i.e- the frequency of occurrence of the set of attributes giving rise 
to that bm number, may be considered to be a probability value if it is divided by the number 
of training images M. However, because the probabilities arei compared with a threshold, 
there is in fact no need to divide through by M as this value would cancel out in ttie 
5 calcxilations. So, in the following discussions, tiie bin "contents" will be referred to as 
•^probability values", and treated as though they axe probability vahies, evra though in a strict 
sense they are iii fact frequencies of occurrence. 

The ^ove process is used boifli in the training ph2[se and in flie detection phase. 

10 Face Detection Phase 

The face detection process involves sampliiig the test image with a nioving 64x64 
window and calculating a face probability at each window position. 

The calculation of the face probability is shown in Figure 11. For each block position 
in the window, the block's bin number 490 is calculated as described in the previous section. 
15 Using the appropriate histogram 500 for the position of the block, each bin number is looked 
up and the probability 510 of that bra number is determined. The sum 520 of the logs of 
these probabihties is then calculated across ^ tiie blocks to generate a face probabihty 
value, (otherwise referred to as a log likelihood value). 

This process generates a probability **m^'* for the entire test image. In other words, a 
20 probability value is daived in respect of each possible window centre position across the 
unage. The combination of all of these probability values into a rectangular (or whatever) 
shaped array is then considered to be a probability **map** corresponding to that image. 

This map is tiien inverted, so that the process of finding a face involves finding 
minima in the inverted map. A so-called distance-based techniqpie is used. This technique 
25 can be summarised as follows: The map Q>bcel) position with the smallest value in the 
inverted probability map is chosen. If this value is larger than a threshold (TD), no more 
faces are chosen. This is the tmnination criterion. Otherwise a face-sized block 
corresponding to tiie chosm centre pixel position is blanked out (i.e. omitted from the 
following calculations) and tiie candidate face position finding procedure is repeated on the 
30 rest of the image until tiie tranination criterion is reached. 
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Nonface method 

The nonface modjsl comprises an additional set of histograms which represent &e 
probability distribution of attributes in non&ce images. The histograms are qreated in exactly 
the same way as for the face model, except tiiat. the training images contain examples of 
nonfaces instead of faces. 

During detection, two log probability values are computed, one using the face model 
and one using the nonface model. These are then combined by simply subtracting the 
non&ce probabihty from the face probability: 

^combined ^ uscd iustead of to produce the probability map (before 

inversion). 

Note that the reason that P^ace is isubtracted from P^ is because these are log 
probability values. 



15 



20 



Histogram Examples 

Figures 12a to 12f show some examples of histograms generated by the training 
process described above. 

Figures 12a, 12b and 12c are derived firom a training set of face images, and Figi^es 
12d, 12e and 12f are derived from a training set of nonj^e images. In parttculan 





Face histograms 


Nonface histograms 


Whole histogram 


Figure 12a 


Figure 12d 


Zoorned onto the main peaks at about h=1500 


Figure 12b 


Figure 12e 


A further zoom onto the region about h=1570 


Figure 12c 


Figure 12f 



25 



It can clearly be seen that the peaks are in difiEerent places in die &ce histogram and 
the nonface histograms. 

Multiscale face detection 

In order to detect faces of different sizes in the test image, the test image is scaled by 
a range of factors and a distance (i^, probability) map is produced for each scale. In Figures 
13a to 13c tile images and their corresponding distance m^s are dbown at three different 
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scales. The method gives the best re^onse Qiigjiest probability, or miniTmrni distance) for 
tiie large (central) subject at the smallest scale (Fig 13a) and better responses for the smaller 
subject (to the left of the main iSgure) at the largo: scales. (A daidcer colour on Ae map 
represents a lower value in the inverted map, or in other words a higher probability of there 

5 being a face).Candidate face poisitions are extracted across different scales by first finding 
the . position which gives the best response over all scales. That is to say, the highest 
probability (lowest distance) is established amongst all of the probability maps at all of the 
scales. This candidate position is the first to be labelled as a face; The window centred over 
fliat fece position is theai blanked otit firbm the probability map at each scale. The size of the 

10 window blanked out is proportional to the scale of tihe probabffity m^. 

Examples of this sealed .blankmg-out process are shown in Figures 13a to 13c. hi 
particular, the hi^est probability across all the maps is foimd at the left hand side of the 
largest scale map (Figure 13c). An area 530 correspondrug to the presumed size of a fece is 
blanked off in Figure 13c. Corresponding, but scaled, areas 532, 534 are blanked off in the 

15 smaller maps. 

Areas larger than the test window may be blanked off in the maps, to avoid 
overlapping detections. In particular, an area equal to the size of the test window surrounded 
by a border half as wide/long as the test window is appropriate to avoid such overlappiug 
detections. 

20 Additional feces are detected by searching for the next best response and blanking 

out the corresponding windows successively. 

The intervals allowed between the scales processed are influenced by the sensitivity 
of the method to variations m size. It was found in this preliminary study of scale invariance 
that the method is not excessively siensitive to variations in size as fsiies which gave a good 
25 response at a certain scale often gave a good response ^ 

The above description refers to detecting a face even thou^ the siz^ 
image is not known at the start of the detection process. Another aspect of multiple scale 
face detection is the use of two or more parallel detections at different scales to validate fhe 
detection process. This can have advantages if, for example, the fece to be detected is 
30 partially obscured, or tiie person is wearing a hat etc. 

Figures 13d to 13g schCTiatically illustrate this process. During the training phase, 
the system is trained on windows (divided into respective blocks as described above) which 
surround ttie whole of the test face CFigure 13d) to genearate *full fece*' histogram data and 
also on windows at an expanded scale so ttiat only a central area of the test fece is inchided 
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(Figure 13e) to gCTierate "zoomed in*' histogram data. This genCTatra two sets of histogram 
data. One set relates to tiie "ftill face" windows of Figure 13d, and the othcar relates to the 
"central face area" windows of Figure 13e. 

During the detection phase, for any given test window 536, the window is applied to 
5 two different scaliugs of tiie test image so that in one (Figure 13i^ the test window surrounds 
the whole of the expected size of a face, and m the other (Figure 13g) the test window 
encompasses the c^itral area of a face at that expected size. These are each processed as 
described above, being coiiq)ared with the respective sets of histogram data appropriate to 
the type of window. The log probabilities from each parallel process are added before the 
10 comparison with a threshold is applied. 

Putting botti of tiiese aspects of multiple scale face detection toge^er leads to a 
particularly elegant saving ia the amount of data that needs to be stored. 

In particular, ia these embodiments the multiple scales for the arrangements of 
Figures 13a to 13c are arranged in a geometric sequence. In the present example, each scale . 

15 in the sequence is a factor of different to the adjacent scale in the sequence. Then, for 
the parallel detection desaibed with reference to Figures 13d to 13g, the larger scale, central 
area, detection is carried out at a scale 3 stqps higher in the sequence, that is, 2*''* times larger 
than the "full face" scale, using attribute data relatmg to flie scale 3 steps higjier in the 
sequence. So, apart from at extremes of the range of multiple scales, the ^metric 

20 progrwsion means that the parallel detection of Figures 13d to 13g can always be carried out 
using attribute data g!snerated in respect of another multiple scale three steps higher in the 
sequence. 

The two prodesses (multiple scale detection and palrallel scale detection) can be 
combined in various ways. For example, the multiple scale detection process of Figures 13a 
25 to 13c can be applied first, and then the parallel scale detection process of Figures 13d to 13g 
can be ^plied at areas (and scales) identified during the multiple scale detection process. 
However, a convenient and efficient use of the attribute data may be achieved by: 

• deriving attributes in respect of the test window at each scale (as in Figisres 13a to 13c) 

• comparing those attributes with the "fiill fece** histogram data to generate a •fiill face" set 
30 of distance ntiaps 

• comparing the attributes wifli the "zoomed in" histogram data to generate a "zoomed in" 
set of distance maps 



wo 2004/051981 



14 



PCT/GB2003/005224 



10 



• for each scale n, combining the "fuU face", distance map for scale n with the "zoomed in" 
distance map for scale n+3 

• deriving fece positions from the combined distance maps as described above with 
reference to Figures 13a to l3c 

Frattier parallel testing ckn be performed to detect different poses, such as looking 
straight ahead, looking partly up. down, feft, right etc. Here a respective set of histogram 
data is required and the results are preferably combined using a "max" function, that is, the 
pose giving the highest probability is carried forward to thresholding, the o&ers being 
discarded. 



Face Tracking 

A fece tracking algorithm will now be described. The tracking algorithm aims to 
inq>rove face detection performance in image sequences. 

The initial aim of the tracking algorithm is to detect every face in every frame of an 
15 image sequence. However, it is recognised that sometimes a face in the sequence may not be 
detected. In these circumstances, the tracking algorithm may assist in interpolating across 
the nussing frice detections. 

Ultimately, flie goal of face tracking is to be able to output some usefiil metadata 
from each set of frames belonging to the same scene in an imiage sequence. This migjit 
20 include: 

• Numba: of faces. 

• 'Mugshot" (a coUoquial word for an image of a person's fece, derived from a tema 
referring to a police file photograph) of each face. 

• Fraine numbebr at wMch each f^ce fifst appiears. 
25 • Frame numb^ at which each fece l^ist appears. 

• Identity of each fece (either matched to faces seen in previous scenes, or matched to a 
fece database) — tibis requires some fece recognition also. 

The tracking algorithm uses the results of the fece detection algorithm, run 
ind^endently on each frame of flie image sequence, as its starting point Because the face 
30 detection algori&m may sometimes miss (not detect) feces, some method of interpolating flie 
missing feces is useful. To this end, a Kahnan filter was used to predict the next position of 
the fece and a skin colour matching algorithm was used to aid tracking of feces. In addition. 
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because the face detection algorithm often gives rise to false acceptances, some method of 
rejecting fhese is also us^il. 

The algorithm is shown schematically in Figure 14. 

The algorithm will be described in detail below, but in summary, ii5)ut video data 
545 (representing the image sequence) is supplied to a face detector of the type described in 
this a^lication, and a skin colour matching detector 550. The face detector attempts to 
detect one or more faces in each image. When a face is detected, a Kahnan Slter 560 is 
Ktablished to track the position of that face. The Kahnan filter generates a predicted 
position for the ssnne fece in the next iinage in the sequence. An eye position comparator 
570, 580 detects whethqr the feoe detector 540 d^ects a face at that position (or within a 
certaiii threshold distance of th^ position) in the next image. If this is found to be flie case, 
then that detected fece position is used to update the Kahnan filter and the process continues. 

If a face is not detected at or near the predicted position, then a s(kin colow matching 
method 550 is used. This is a less precise fece detection technique which is set iq> to have a 
15 lower threshold of acceptance than flie fece detector 540, so that it is possible for the skin 
colour matching technique to detect (what it considers to be) a fece even when the face 
detector cannot make a positive detection at that position. If a "face" is detected by skin 
colour matching, its position is passed to the Kalman filter as an updated position and fhe 
process continues. 

20 If no match is found by eithar the face detector 450 or the skin colour detector 550, 

flien tiie predicted position is used to update the Kahnan filter. 

All of these results are subject to acceptance criteria (see below). So, for example, a 
fece that is tracked throu^out a sequence on the basis of one poative detection and tiie 
remainder as predictions, or the rCToainder as skin colour detections, will be rejected. 

25 A sepjoate Kalman filtCT is used to track each fece in the traddiig algorithm. 

In order to use a Kalman filter to track a face, a state model rqiresenting ibs fece 
must be created. In the model, the position of each fece is rqjresented by a 4-dimensional 
vector containing the co-ordinates of tiie left and right eyes, which in turn are derived by a 
predetemiiaed relationship to the centre position of the window and the scale bemg used: 

30 

' FirstEyeX ' 
FirstEyeY 
SecondEyeX 
SecondEyeY 
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where k is tiie frame number. 

The cmxent state of the face is represented by its position, velocity and acceleration, 

in a 12-diinensional vector: 



First Face Detected 

The tracking algorithm does nothing tmtil it recdlves a frame with a face detection 
result indicating that there is a face present. 

A Kalman filter is ttien initialised for each detected fece in this frame. Its state is 
initialised with the position of the face, and with zero velocity and acceleration: 

[p(ky 



0 
0 



It is also assigned some other attributes: the state model error covariance, Q and the 
observation error covariance, R. The error covariance of the Kalman filter, P, is also 
initiaUsed. These parameters are described ia more detail below. At the beginning of the 
following frame, and every subsequent firame, a Kahnan filter prediction process is carried 
out 



20 



25 



Kalman Filter Predictiotf Frocess 

For each existing Kalman filter, the next position of the fece is predicted using the 
standard Kalman filter prediction equations shown below. The filter uses the previous state 
(at fiBme k-1) and some otiier internal and external variables to estimate the current state of 
the filter (at frame k). 



State prediction equation: 



Covariance prediction equation: P»(*) = 0(*,*-l)P„ if 

vrfiere z^{k) denotes the state before updating the filter for frame A, z„{k-l) denotes the 
state after tq)dating ttie filter for firame k-l (or the initialised state if it is a new filter), and 
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<l>{kyk -l) is the state transition matrix. Various state transition matrices were experimented 
witii, as described below. Similarly, den&tes the filter's error covariance before 

updating the filter for fi^e k andP„(A:-l) denotes the filter's error covariance after 
iq)dating the filter for the previous firame (or the initialised value if it is a new filter). P^{k) 

can be thought of as an internal variable in the filter that models its accuracy. 

Q{k) is the error covariance of the state model. A high value of Q(k) means that the 

predicted values of the filtrac's state (i.e. the face's position) will be assumed to have a high 

level of error. By tuning this paraineter, the behaviour of the filter can be changed and 

X>ot0Qti£dly improved for face detection. 

State Transition Matrix 

The state transition matrix, <I>(^,*-l), detennines how tihe prediction of the next 
state is niade. Using the equations for motion^ tiie following matrix can be derived for 



20 



<D(;fe,A:-l) = 



I. 



where O4 is a 4x4 zero matrix aiid is a 4x4 identity matrix. 6t can simply be set to 1 (i.e. 

units of t are fi-ame periods). 

This state transition matrix models position, velocity and acceleration. Howevear, it 
was found ttiat tiib use of acceleration tended to make the face predictions accelerate towards 
the edge of tite picture when no face detections were available to correct the predicted state. 
ThCTefore, a simpler state transition matrix without using acceleration was prefaired: 

<Sf{k,k-\) = \0^ /4 O4 



The predicted eye positions of each Elahnan filter, z^ik), are compared to all fece 
25 detection results in the current firame (if there are any). If the distance between the eye 
positions is below a given threshold, thrai the face detection can be assumed to belong to the 
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same face as that being modeUed by the Kahnan filter. The face detection result is then 
treated as an observati6n, y(k), of the face's current state: 

" \pm' 

y{k)= 0 
0 

where p(k) is the position of the eyes in the fece detection result This observation is used 
5 during flieKalman filter update stage to help correct the prediction. 

Skin Colour Matching 

Skin colour matching is not used for faces that successfully matdi face detection 
results. Skin colour matching is only performed for feces whose position has been predicted 
10 by the Kalman filter but have no matching face detection result ra the currrait fiame, and 
therefore no observation data to help iq)date the Kahnan filter. 

In a first technique, for each face, an elliptical area centred on the face's previous 
position is extracted from the previous firame. An example of such an area 600 within ttie 
fece window 610 is shown schcanatically in Figure 16. A colour model is seeded using the 
15 chrominance data from this area to produce an estimate of the mean and covaiiance of the Or 
and Cb values, based on a Gaussian model. 

■An area around the predicted fece position in the current frame is then searched and 
the position that best matches the colour model, again averaged over an elUptical area, is 
f selected. If the colour matdi meets a given similarity criterion, then this position is used as 
20, an observation, y{k), of the fece's current state in the same way described for fece detection 

results in the previous section. 

Figures 15a and 15b scheinatically illustrate the generation of the search area, hi 

particular. Figure 15a schematically illustrates the predicted position 620 of a fece within the 
next image 630. In skin colour matching, a search area 640 surrounding the predicted 
25 position 620 in ^e next image is searched for tilie fece. 

If the colour ma^h does not rneet the similarity criterion, then no reUable observation 
data is available for the current firame. Instead, the predicted state, z^{k) is used as the 
observation: 

30 The skin colour matching methods descaribed above, use a simple Gaussian skin 

colour modeL The model is seeded on an elliptical area centred on the fece in the previous 
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frame, and used to find the best matching elliptical area in the current frame. However, to 
provide a potentially better performance, two fiirfliCT mettiods were developed: a colour 
histogram method and a colour mask method. These will now be described. 

5 Colour Histogram Method 

In this method, instead of using a Gaussian to model tifcie distribution of colour in the 
tracked face, a colour histogram is used. 

For each tracked face in the previous frame, a histogram of Cr and Cb values within a 
square window around the face is computed. To do this, for each pixel the Cr and Cb values 
10 are first combined into a single vahie. A histogram is then computed that measures tiie 
frequency of occurrence of these values in the whole window. Because the number of 
combined Cr and Cb values is large (256x256 possible combinations), flie values are 
quantised before the histogram is calculated. 

Having calculated a histogram for a tracked face in the previous firame, tiie histogram 
15 is used in the current fimne to try to estimate the most likely new position of flie face by 
finding the area of the image with the most similar colour distribution. As shown 
schematically in Figures 15a and 15b, this is done by calculating a histogram in exactiy the 
same way for a range of window positions within a search area of the current frame. This 
search area covers a given area around the predicted face position. The histograms are then 
20 compared by calculating the mean squared error (MSB) between the original histogram for 
the tracked face in the previous firame and each histogram in the current firame. The 
estimated position of the face in the current frame is given by the position of tiie minimimi 
MSB. 

Various rnodifications may be made to this algorithm, including: 
25 • Using tibree chaimels (Y, Cr and Cb) instead of two(Cr, Cb). 

• Varying tibe number of quantisation levels. 

• Dividing flie window into blocks and calculating a histogram for each block. In this way, 
the colour histogram method becomes positionally dependent The MSB between each 
pair of histograms is summed in this method. 

30 • Varymg tiie number of blocks into which the window is divided. 

• Varying the blocks fliat are actually used - e.g. omitting the outer blocks which might 
only partially contain &ce pixels. 
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For Uie test data used in empirical trials of these techniques, the best results were - 
achieved using the followmg conditions, alfliough other sets of conditions may provide 
equally good or better results witili diflferent teist data: 

• 3 channels (Y, Cr and Cb). 

5 • 8 quantisation levels for each channel (ie. histogram contains 8x8x8 = 512 bins). 

• Dividing the windows into 16 blocks. 

• Using all 16 blocks. 

Colonr Mask Method 

10 This method is based on the method first described above. It uses a Gaussian skin 

colour model to describe flie? distribution of pixels in the fece. 

hi the method first described above, an eUiptical area centred on Ihe fece is used to 
colour match feces, as this may be perceived to reduce or minnmse the quantity of 
background pixels which might degrade the model. 
15 hi the present colour mask model, a similar eUiptical area is still used to s^ a colour 

model on the original tracked face in the previous firame, for example by applymg the mean 
and covariance of RGB or YCrCb to set parameters of a Gaussian model (or alternatively, a 
defeult colour model such as a Gaussian model can be used, see below). However, it is not 
used when searchmg fat the best match in the cuirent ftame. Instead, a mask area is 
20 calculated based on the distribution of pixels m the original fece window ftom the previous 
ftame. The mask is calculated by finding the 50% of pix:els m the window which best match 
the colour model. An example is shown in Figures 17a to 17c. In particular. Figure 17a 
schematicaUy iUnstrates the initial window under tesU Figure 17b schematically ilhisttates 
the eUiptical window used to seed the colour model; and Figure 17c schematically ilhistrates 
25 the mask defined by the 50% of pixels which most closely match ihe colour model. 

To estimate the position of the face in the current frame, a search area around the 
predicted face position is searched (as before) and the "distance" from the colour model is 
calculated for each pixel. The "distance" refers to a difference firom the mean, normahsed in 
each dimension by the variance in that dimension. An example of the resultant distance 
30 image is shown in Figure 18. For each position in this distance m^ (or for a reduced set of 
sampled positions to reduce computation time), the pixels of Ihe distance image are averaged 
over a mask-shsq)ed area The position with the lowest averaged distance is then selected as 
ttie best estimate for flie position of the fece in this firame. 
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This method ttrns differs from the original method in that a naiask-sli2?)ed area is used 
in tihe distance image, iostead of an elliptical area. This allows the colonr match method to 
use botihi colour and shape information. 

Two variations are proposed and were implCTaented in empirical trials of the 

techniques: 

(a) Gaussian skin colour model is seeded using the mean and covariance of Cr and Cb 
from an eUiptical area centred on the tracked face in the previous frame. 

(b) A default Gaussian skin colour model is used, both to calciilate the mask in the 
previous frame and calculate the distance image m the current frame. 

0 The use of Gaussian skin colour models will now be described finrttier. A Gaussian 

model for tibie skin colour class is built using the chrouMnance components of the YCbCr 
colour space. The similarity of test pixels to the skin colour class can then be measured. Tins 
method thus provides a skin colour likelihood estimate for each pixel, indepeaadentiy of iho 
eigenface-based approaches. 

15 Let w be the vector of the CbCr values of a test pixel. The probability of w belonging 

to the skin colour class S is modelled by a two-dimensional Gaussian: 



where the mean //, and the covariance matrix S, of tiie distribution are (previously) 

estimated from a training set of skin colour values. 

20 Skin coloin: detection is not considCTed to be an efifective face detector when used on 

its own. This is because tiiere can be many areas of an image that are siaiilar to skin colour 
but are not nectessarily faces, for example otiier parts of the body. However, it can be used to 
inqpiove the perfonnance of the eigenblock-based ^preaches by using a combined ^roach 
as described in respect of titie present fece tracking system. The decisions made on wheflier 

25 to accept the face detected eye positions or the colour matched eye positions as the 
observation for the Kafanah filter, or whether no observation was accepted, are stored. These 
are used later to assess the ongoing validity of the feces modelled by each Kalman filter. 

Kalman Filter Update Step 
30 The xxpdate step is used to determine an appnqjriate output of flie filter for tiie current 

frame, based on the state pr«Iiction and the obsorvation data. It also tq)dates the internal 
variables of the filter based aa the enot between die predicted state and the obsCTved state. 
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The following equations are used in the update step: 
Kahnan gain equation 

State update equation 

Covariance update equation « ^a(^) = ^*(*^)-J5:(A:)H(A:)P^(A:) 

5 Here, K{k) denotes the K^tman jgain, another variable internal to the Elaliiian jSlter. It 

is used to determine how much the predicted state should be adjusted bsised on the observed 
state, y{k). 

H{k) is- the observation matrix. It determines which parts of the state can be 
obsCTved. In our case, only the position of the face can be observed, not its velocity or 
10 acceleration, so the foUowing matrix is used for H{k) : 

I, o, o; 

o, o, o, 

O, 

R{k) is the error covariance of the obsCTvatioii datau In a similar way to , a high 
value of -R(A:) means that the observed values of the filter*s state (i.e. the face detection 
results or colour matches) will be assumed to have a high level of error. By tuning this 

15 parameter, tiie behaviour of tiie filter can be changed and potentially improved for fiace 
detection. For our experiments, a large value of R{]c) relative to 0{k) was found to be 
suitable (this means that the predicted face positions are treated as more reliable than the 
observations). Note that it is permissible to vary these parameters firom frame to jframe. 
ThCTefore, an inteiresting future area of investigation may be to adjust the relative values of 

26 . and Q{Jc) diependiQg on whether the observation is based on a face detection result 

(reliable) or a colour match (less reliable). 

For each Klahnan filtCT, the updated state, (A:) , is used as the jBnal decision on the 

position of the &ce. This data is output to file and stored. 

Unmatched face detection results are treated as new faces. A new Kalman filter is 
25 initialised for each of these. Faces are removed which: 
• Leave the edge of the picture and/or 
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• Have a lack of ongoing evidence siq)poTting them (when there is a higji jiroportion of 
bbs^^ations based on ICalman filter predictions rather than face detection results or 
colour matches). 

For these faces, the associated Kalman filter is removed and no data is output to file. 
5 As an optional differrace from this approach, where a face is detected to leave the picture, 
the tracking results up to the finame before it leaves the picture may be stored and treated as 
vaHd face tracking results (providing that the results meet any other criteria apphed to 
validate tracking results). 

These rules may be foimahsed and built vpon by bxinging in some additional 
10 variables: 

prediction jxcceptance^atio^hreshold If, during tracking a given face, the proportion 

of accepted Elalman predicted face positions 
exceeds this threshold, then the tracked face is 
15 , rejected. 

This is cuTTCTitly set to 0.8. 

detection_acceptance_ratio_threshold During a final pass through all the frames, if for 

a given face the proportion of accepted face 
20 detections falls below this threshold, thm the 

tracked &ce is rejected. 
This is currently set to 0.08. 

min ^frames 

25 



30 fincdjpredictionjjtcceptancejratio^hre^^ and min Jramesl During a final pass 

thiDugh all the frames, if for a given tracked 
face the number of occurrences is less than 
ininjBrames2 AND the prc^rtion of accq>ted 
Kalman {sedicted 3ice positions exceeds the 



During a fitaal pass through all the frames, if for 
a given &ce the numbei^of occurrences is less 
than mki^fiames, the face is rgected. This is 
only likely to occur near the end of a sequence. 
rmn_frames is currentiy set to 5. 
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finaljrediction_acc^tance_ratiojthreshold, 
the face is rejected. Again, tihis is only likely to 
occur near the aid of a sequence. 
final_j>rediction_acceptance_ratio_jthreshold is 
cuirently set to 0.5 and min_jB:ames2 is 
cuirehtly set to 10. 

Additionally, feces are now removed if they are 
tracked such that the eye spacing is decreased 
below a given nmumum distance. This can 
happen if the Kalman filter falsely believes the 
eye distance is becoming smaller and th^re is no 
other evidence, e.g. face detection results, to 
correct this assumption. If uncorrected, the eye 
distance would eventually become zero. As an 
optional alternative, a minimum or lower limit 
eye separation can be forced, so that if the 
detected eye sq[>aration reduces to the roinimum 
eye separation, the detection process continues 
to search for faces having that eye separation, 
but not a smaller eye separation. 

It is noted that the trackmg process is not limited to tracking through a video 
sequence in a forward temporal direction. Assuming that the unage data remain accessible 

25 (i.e. the process is not real-time, or the image data are buffered for temporary continued use), 
the entire tracking process could be carried out in a reverse temporal direction. Or, when a 
first fece detection is made (often part-way through a video sequence) the tracking process 
could be initiated in both tenq)oral directions. As a further option, the tracking process could 
be run in both tenqwral directions through a video sequence, with the results being combined 

30 so that (for exan^le) a tracked face meetmg the accq?tance criteria is included as a vahd 
result whichever direction the tracking took place. 



min_eye_spacing 

10 
15 
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Overlap Rules for Face Tracking 

When the feces are tracked, it is possible for the face tracks to become overlapped. 
When this happens, in at least some appUcations» one of flie tracks should be deleted. A set 
of rules is used to determine which face track should persist in the event of an overlap. 

Whilst the faces are being tracked there are 3 possible types of track: 

D: Face Detection - the current position of the face is confirmed by a new face 
detection 

S: Skin colour track - there is no face detection, but a suitable skin colour track 
has bcCTi found 

P: Prediction - there is neitiier a suitable fece detection nor skin colour track, so 
the predicted face position firom the Kahnan filter is used. 

The following grid defines a priority order if two face tracks overlsp with each other: 



Face 1 ^-""^ — 


D 


S 


P 


P 


Largest Face Size 


D 


D 


S 


D 


Largest Face Size 


S 


P 


D 


S 


Largest Face Size 



So, if botii tracks are of the same type, then the largest face size detCTnines which 
track is to be maintained. Otherwise, detected tracks have priority over skin colour or 
predicted tracks. Skin colour trades have priority over predicted tracks. 

15 In the tracking me&od described above, a fece track is started for every face 

detection that cannot be matched up with an existing track. This could lead to many false 
detections being enroneously tracked and persisting for several firames before finally being 
rejected by one of the existing rules (e.g. by the rule associated with the 
prediction_acceptance_ratio_threshold) 

20 Also, the existing rules for rejecting a track (e.g. ihosc rules relating to the variables 

prediction_accqMmce_ratio_threshold md detection_acceptance_ratio_threshold), are 



wo 2004/051981 PCT/GB2003/005224 

26 

biased against tracking someone who turns their head to the side for a significant length of 
time. Li reality, it is often desirable to carry on tracldhg someone who does this. 
A solution will now be described. 

The first part of the solution helps to prevent false detections from setting off 
5 erroneous tracks. A face track is still started internally for every face detection that does not 
match an existing track. However, it is not output &om the algorithm. In order for this track 
to be maintained, the first/ jBrames in titke track must be face detections (i.e. of type D )• If all 
of thie first/firames are of type D then the track is ntiaintained and face locations are output 
ffom the algorithm fix>m frame / onwards. 
10 If all of tiie first n frames are not of type D, then the face track is terminated and no 

face locations are output for this track. 
/ is typically set to 2, 3 or 5. 

The second part of the solution allows faces in profile to be tracked, for a long period, 
rattier than having their tracks terminated due to a low detection_acceptance_ratio. To 
15 achieve this, where the faces are matched by the ± 30** eigenblocks, the tests relating to the 
variables prediction_acceptance_ratio_threshold and detection_acceptance_ratio_threshold 
are not used. Instead, an option is to include the following criterion to maintain a face track: 
g consecutive face detections are required every n frames to maintain the face track 
where g is typically, set to a similar value to/ e.g. 1-5 fi^ames and n corresponds to 
20 the maximum number of fi:ames for which we wish to be able to track someoiie when they 
are turned away from the camera, e.g. 10 seconds (= 250 or 300 firames depending on frame 
rate). 

This may also be combined with the prediction_acceptance_ratio_threshold and 
d^ection_acc€ptance_ratio_threshold rules. Altematively, the 

25 prediction jicc^tance_ratiojhreshold and detection_acceptance_ratio^hreshold may be 
apphed on a rolling basis e.g. over only the last 30 frames, rather flian since the beginning of 
the track. 

Anottier criterion for rejecting a face track is that a so-called **bad colour threshold** 
is exceeded. In this test a tracked face position is vahdated by skin colour (whatever the 
30 ' acceptance type — face detection or Kahnan prediction). Any face whose distance fcom an 
€iq>ected skin colour exceeds a given "bad colour threshold" has its track tenninated. 

In the method described above, tiie skin colour of the face is only checked during 
skin colour tracking. This means that non-skin-coloured false detections may be tracked, or 
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the fece track may wander ofiF into non-skin-coloured locations by using ttie predicted fece 
position. 

To improve on this, whatever tiie acceptance type of the face (detection, skin colour 
or ICalman prediction), its skin colour is checked. If its distance (difference) from skin colour 
5 exceeds a bad_colour_threshold, then the face track is terminated. 

An efficiait way to implement this is to use the distance from skin colour of each 
pixel calculated during skin colour tracking. If this measure, averaged over the face area 
(ei&er over a mask shaped area, over an elliptical area or over tiie whole face window 
depending on which skin colour tracking method is being used), exceeds a fixed threshold, 
10 then the fkoe track is terminated. 

A fiirfher criterion for rejectuig a face track is that its variance is very low or very 
high. This technique will be described below after the description of Figures 22a to 22c. 

In the tracking system shown schematically in Figure 14, tiiree frtrfher featured are 
included. 

15 Shot boundary data 560 (from metadata associated with the image sequence under 

test; or metadata generated within the camera of Figure 2) defines the lirnits of each 
contiguous "shof ' within the image sequence. The Kalman filter is reset at shot boundaries, 
and is not allowed to carry a prediction over to a subsequent shot, as the prediction would be 
meaningless. 

20 User metadata 542 and camera setting metadata' 544 are supplied as inputs to the face 

detector 540. These may also be used in a non*tracking system. Examples of the camera 
setting metadata were described above. User metadata may include information such as: 

• typo of prograimne (e.g. news, interview, drama) 

• script information such as specification of a "long shot" , '^medium dose-iq)'* etc 
25 (particular types of camera shot leading to an expected sub-range of fece sizes), how 

miany people involved in each shot (again leading to an expected sub-range of face sizes) 
and so on 

• sports-related information - sports are often filmed from fixed camera positions using 
standard views and shots. By specifying these in the metadata, again a sub-range of face 

30 sizes can be d^ved 

The type of {irogramme is relevant to the type of face which may be expected in the 
images or image sequence. For exanQ>le, in a news pro^nmme, one would expect to see a 
single fece for much of the image sequence, oociq>ying an sarea of (say) 10% of tiie sa?een. 
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The detection of faces at different scales can be weighted in response to tins data, so that 
faces of about this size are given an enhanced probability. Another alternative or additional 
approach is that the search range is reduced, so that instead of searching for faces at all 
possible scales, only a subset of scales is searched. This can reduce the processing 

5 requirements of the face detectioii process. Jn a software-based system, the software can run 
more quickly and/or on a less pbwerftd processor. In a hardware-based system (including 
for example an application-specific integrated circuit (ASIC) or field programmable gate 
array (FPGA) system) the hardware needs may be rieduced. 

The other types of user metadata mentioned above may also be s^plied in fliis way. 

10 The "expected face size*' sub-ranges may be stored in a look-iq) table held in the memory 30, 
for example. 

As regards camera metadata, for example the current focus and zoom settings of the 
lens 1 10, these can also assist the face detector by giving an initial indication of the expected 
image size of any faces that may be present in the foreground of the image. In this regard, it 

15 is noted that the focus and zoom settings between them define the expected separation 
between the camcorder 100 and a person being filmed, and also the magnification of the lens 
110. From fhesG two attributes, based upon an average face size, it is possible to calculate 
the expected size (in pixels) of a fece in the restating image data, leading again to a sub- 
range of sizes for search or a wdi^ting of the expected face sizes. 

20 This arrangement lends itself to use in a video conferencing or so-called digital 

signage CTvironment 

hi a video conf^encing arrangemeot the user could classify the video material as 
"individual speaker*', "Group of two^ "Group of three" etc, and based on this classification a 
face detector could derive an expected face size and could search for and highligjit tiie one or 
25 more faces in the iniage. 

hi a digital signage environment, advertising material could be displayed on a video 
screen. Face detection is used to detect tiie faces of people looking at the advertising 
materiaL 

30 Advantages of the traddng algorithm 

The face tracking technique has three main benefits: 
• It aUows missed feces to be fiUed in by usmg Kalnian filtering 

m fi^es for which no face detection results are available. This inweases the true 
acceptance rate across the image sequence. 
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• It provides face linkmg: by successfully tracidng a face, the algorilhm automatically 
knows whether a face detected in a future firaiae belongs to the same person or a different 
person. Thus, scene metadata can easily be generated from ttiis algorithm, comprising tiie 
number of faces in the scene, the frames for which they are present and providing a 

5 representative mugshot of each face. 

• False face detections tend to be rejected, as such detectipns tend not to carry forward 
between images. 

Figures 19a to 19c schematically illustrate the use efface tracking when applied to a 
video scene. 

10 In particular. Figure 19a schematically illustrates a video scene 800 comprising 

successive video images (e.g. fields or frames) 810. 

In this example, the images 810 contain one or nq^ore faces. In particular all of the 

images 810 in ttie sceaie include a face A, shown at an xxpp&c left-hand position within the 

schematic r^resentation of the image 810. Also, some of the images include a face B shown 
15 schematically at a lower right hand position within the schematic representations of the 

images 810. 

A face tracking process is applied to the scene of Figure 19a. Face A is tracked 
reasonably successfully throughout the scene. In one image 820 the face is not tracked by a 
direct detection, but the skin colour matctdng techniques and the ICalman filtaing techniques 
20 described above mean that the detection can be continuous either side of the *^ssrng** image 
820. The rqpresentadon of Figure 19b indicates the detected probability of a fece being 
present in each of the images. It can be seen that the probability is highest at an image 830, 
and so tiie part 840 of the image detected to contain face A is used as a '"picture stamp*' in 
respect of face A. Picture staasps will be described in more detail below. 
25 Similarly, face B is detected with dififerent levels of confidence, but an image 850 

gives rise to the highest detected probability of face B being present. Accordingly, the part 
of the corresponding image detected to contain face B (part 860) is used as a picture stamp 
for face B within that scene. (Altematively, of course, a wider section of the image, or even 
the whole image, could be used as the picture stamp). 
30 For each tracked face, a single representative face picture stanqj is required. 

Outputting tiie face picture stamp based purely on face probability does not always give the 
best qu^ty of pictwre stsasxp. To get the best picture quality it would be better to bias or 
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Steer the selection deci^on towards feces that are detected at tiie same resolution as the 
picture stamp, e.g. 64x64 pixels / 

To get the best quality picture stamps the following scheme may be appUed: 
(1) Use a face that was detected (not colour tracked / Kalman tracked) 
5 (2) Use a face that gave a high probabiUty during face detection, i.e. at least a 

threshold probabihty 

(3) Use a face which is as close as i>ossible to 64x64 pixels, to reduce rescaling 
artefacts and improve picture quality 

(4) Do not (if possible) use a very early face in the track, i.e. a face in a 
10 predetemiined initial portion of the tracked sequence (e.g. 10% of the tracked sequence, or 

20 frames, etc) in case tins means that the face is still very distant (i.e. small) and blurry 
Some rules that could achieve this are as foUows: 
For each face detection: 

15 Calculate the metric M = face_probability * size_weighting, where size_weighting == 

MIN( (face_size/64)^x, (64/face_size)^) and x=0.25. Then take the face picture stamp for 
which M is largest 

This gives the following weightings on the face probability for each face size: 

20 



i&ce_size 


size_weighting 


16 


0.71 


19 


0.74 


23 


0.77 


27 


0.81 


32 


0.84 


38 


0.88 


45 


0.92 


54 


0.96 


64 


1.00 


76 


0.96 


91 


0.92 


108 


0.88 


128 


0.84 
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152 


0.81 


181 


0.77 


215 


0.74 


256 


0.71 


304 


0.68 


362 


0.65 


431 


0.62 


512 


0.59 



10 In practice tbis could be done using a look-up table. 

To make the weighting fimction less harsh, a smaller power than 0^5, e.g x=0.2 or 
0.1, could be used. 

This weighting technique could be applied to the whole face track or just to the fiurst 
15 N frames (to ^ply a weighting against the selection of a poorly-sized face from tihose N 
frames). N could for example represent just the first one or two seconds (25-50 frames). 

In addition, preference is given to faces that are frontally detected over those that 
were detected at +- 30 degrees (or any other pose). 

Figure 20 schranadcally illustrates a display screen of a non-linear editing system. 
20 Non-linear editing systems are well established and are generally implem^ited as 

software programs rmming on general purpose conoiputing systems such as &e. system of 
Figure 1. These editing systCT^s allow video, audio ^d other material to be edited to an 
output media product in a manner which does not depend on the order in which ttie 
individual inedia items (e.g. video shots) were c^tured. 
25 The schematic display screen of Figure 20 includes a viewer area 900, in which video 

clips be may viewed, a set of clip icons 910, to be described fiirther below md a *timeline" 
920 including represeaitations of edited video shots 930, each shot optionally containing a 
picture stamp 940 indicative of the content of that shot 

At one level, the face picture stamps derived as described with refereace to Figiffes 
30 19a to 19c could be used as the picture stamps 940 of each edited shot so, within the edited 
length of the shot, which may be shorter than the originally capture shot, the picture stamp 
ie[»^esenting a fece detection which resulted in the highest face probabiUty value can be 
inserted onto &e time line to show a rq)resentative image from that shot The probability 
values may be ccMSp^^ed wifli a thre^M, possibly hi^er thm the basic face detection 
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threshold, so that only face detections having a high level of confidence are used to generate 
picture stanQ>s in this way. If moire than one face is detected in the edited shot, the face with 
tite highest probability may be displayed, or alternatively more than one face picture stanap 
may be displayed on the time line. 
5 Time lines in non-linear editing systems are usually capable of being scaled, so that 

the length of line corresponding to the full width of the display screen can represent various 
differCTLt time periods in the output media product. So, for example, if a particular boimdary 
between two adjacent shots is being edited to frame accuracy, the time line may be 
"expanded" so that the width of the display screen represents a relatively short time period in 
IQ the output media product. On the other hand, for other purposes such as visualising an 
overview of the output media product, the time line scale may be contracted so that a longer 
time period may be viewed across the width of the display screen. So, dep^ding on the 
level of ^pansion or contraction of the time line scale, there may be less or more screen area 
available to display each edited shot contributing to title output mediai product 
15 In an expanded time line scale, there may well be more than enough room to fit one 

picture stamp (derived as shown in Figures 19a to 19c) for each edited shot making up the 
output media product. However, as the time line scale is contracted, this may no longer be 
possible. In such cases, the shots may be groixped together in to "sequences", where each 
sequence is such that it is di^layed at a display screen size large enough to accommodate a 
20 pl^ise picture stanq>. From within the sequence, then, the &ce picture stamp having the 
highest corresponding probability value is selected for display. If no &ce is detected within 
a sequence, an arbitrary image, or no image, can be displayed on the timeline. 

Figure 20 also shows schematically two **face timelines'' 925, 935. These scale with 
the ''^main" timeline 920. Each face timeline relates to a single tracked face, and shows the 
25 portions of the ou^ut edited sequence containing that tracked face. It is possible that the 
user may observe that certain faces relate to the same person but have not been associated 
with one another by the tracking algorithm. The user can "link" these faces by selecting the 
relevant parts of the face timelines (using a standard Windows^^ selection technique for 
multiple items) and tiien clicking on a **linl^' screen button (not shown). The face timelines 
30 would then reflect the linkage of the whole group of face detections into one longer tracked 
face. Figures 21a and 21b schematically illustrate two variants of chp icons 910* and 910". 
These are displayed on the display screen of Figure 20 to allow the user to select individual 
clips for inclusion in the time line and editing of their start and end positions (in and out 
points). So, each clip icon represents the whole of a respective clip stored on fh& system. 
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In Figure 2 1 a, a clip icon 91 0" is represented by a single face picture stamp 912 and 
a text label area 914 which may include, for example, time code information defining flie 
position and length of that cUp. In an alternative arrangement shown in Figure 21b, more 
than one face picture stamp 916 may be included by using a multi-part clip icon. 
5 Another possibihty for the cUp icons 910 is that they provide a "face summary' so 

that all detected faces are shown as a set of cUp icons 910, in the order in which they appear 
(either in the source material or in the edited output sequence). Again, faces that are the 
same person but which have not been associated with one another by the tracking algorithm 
can be linked by the usct subjectively observing that they are the same fece. The user could 
10 select the relevant face cHp icons 910 (using a standard Windows'*^ selection technique fOT 
multiple items) and then cUck on a *1ink" screen button (not shown). The tracking data 
would then reflect the linkage of the whole groiq) of face detections mto one longer tracked 
face. 

A further possibiUty is that the clip icons 910 could provide a hyperliidc so that the 
15 user may cUck on one of the icons 910 which would then cause the corresponding cUp to be 
played in the viewer area 900. 

A similar technique may be used in, for example, a surveillance or closed circuit 
television (CCTV) system. Whenever a fece is tracked, or whenever a face is tracked for at 
least a predetemiined number of frames, an icon similar to a clip icon 910 is generated in 
20 respect of the continuous portion of video over which that fiice was tracked. The icon is 
displayed in a similar manner to the cUp icons in Figure 20. CUcking cm an icon causes the 
replay (in a window similar to the viewer area 900) of the portion of video oyer widda. that 
particular face was tracked. It will be appreciated that multiple diffwent faices could be 
tracked in this way, and that tlie corresponding portions of video could ovrarlap or even 
25 complet^y coincide. 

Figures 22a to 22c schematically illustrate a gradient pre-processing technique. 
It has been noted that image windows showing litfle pixel variation can tend to be 
detected as feces by a face detection arrangement based on eigenfaces or ^genblocks. 
Therefore, a pre-processing step is proposed to remove areas of Uttte pixel variation from the 
30 fece detection process. In the case of a multiple scale systan (see above) the pre-processing 
step can be carried out at each scale. 

The basic process is ftiat a "gradient tesf * is applied to each possible window position 
across the whole fanage. A predetermined pixel position for each window position, such as 
flie pixel at or nearest Has centre of fliat window position, is flagged or labelled in 
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dependeace on the results of the test ^plied to that window. If the test shows that a window - 
has little pixel variation, that window position is not used in the face detection process. 

A first step is illustrated in Figure 22a. This shows a window at an arbitrary window 
position in the image. As mentioned above, the pre-processing is repeated at each possible 
5 window position. Referring to Figure 22a, although the gradient pre-processing coxild be 
applied to the whole window, it has been fo\ind that better results are obtained if the pre- 
processing is applied to a central area 1000 of the test window 1010. 

Referring to Figure 22b, a gradient-based measure is derived firom the window (or 
firom the central area of the window as shown in Figure 22a), which is the average of the 
10 absolute differences between all adjacent pixels 1011 in both the horizontal and vertical 
directions, taken over the window. Each window centre position is labelled witii this 
gradient-based measure to produce a gradient ''map" of the image. The resulting gradiratt 
map is then compared witib a threshold gradient value. Any window positions for which the 
gradient-based measure lies below the threshold gradiCTit value are excluded from the fece 
15 detection process in respect of that image. 

Altemative gradient-based measures^ could be iised, such as the pixel variance or the 
mean absolute pixel difference from a mean pixel value. 

The gradient-based measure is pref^ably carried out in respect of pixel luminance 
values, but could of course be applied to other image components of a colour image. 
20 Figure 22c schematically illustrates a gradient map derived from an example image. 

Here a lower gradient area 1070 (shown shaded) is excluded from face detection, and only a 
highergradient area 1080 is used. The embodiments described above have related to a 
face detection sys^n (involving training and detection phases) and possible uses for it in a 
camera-recorder and an editing system. It will be appreciated that there, are many oth^ 
25 i>ossible uses of such techniques, for example (and not limited to) security surveillance 
systems, media handling in general (such as video t^e records controllers), video 
confereacing systems and the like. 

In other embodiments, window positions having high pixel differences can also be 
flagged or labelled, and are also excluded froni the face detection process. A '*high" pixel 
30 differ^ce means that tiie measure desoibed above with respect to Figure 22b exceeds an 
iq)per flireshold value. 

So, a gradient map is produced as described above. Any positions for which the 
gradient measure is lower than the (first) threshold gradient value mentioned earli^ are 
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excluded firoxn face detection processing, as are any positions for which the gradient measure 
is higher flian the iq)per threshold vahie. 

It was mentioned above that the "lower ttireshold'* processing is preferably applied to 
a central part 1000 of the test window 1010. The . same can apply to the *^q)per threshold" 

5 processing. This would mean that only a single gradient measure needs to be derived in 
respect of each window position. Alternatively, if the whole window is used in respect of the 
lower threshold test, the whole window can jsimilarly be used in respect of the \sppcr 
threshold test. Again, only a single gradient measure needs to be derived for each window 
positiort Of course, however, it is possible to use two different arrangements, so that (for 

10 example) a central part 1000 of the test window 1010 is used to derive the gradieat measure 
for the lower threshold test, but the full test window is used in respect of the upper threshold 
test 

A further criterion for rejecting a face track, mentioned earlier, is that its variance or 
gradient ineasure is very low or very high. 
15 In this technique a tracked face positioii is validated by variance jfrom area of interest 

m^. Only a faccrsized area of tibe map at the detected scale is stored per face for the next 
iteration of tracking. 

Despite the gradient pre-processing described above, it is still possible for a skin 
colour tracked or Kalman predicted face to moye into a (non-face-like) low or high variance 
20 area of the image. So, during gradient pre-processing, the variance values (or gradient 
values) for the areas around existing face tracks are stored. 

When the final decision on the face's next position is made (with any acceptance 
type, d^&er face detection, skin colour or Kalman prediction) &e position is v£didated against 
the stored variance (or gradient) values in the area of interest m^. If the position is found to 
25 have very high or very low v£ffiance (or gradient), it is considered to be noh-face-Iike and the 
&ce track is tCTminated. This prevents face tracks from wandering onto low (or high) 
variance background areas of the image. 

Alternatively, even if gradient pre-prox>essing is not used, the variance of flie new 
face position can be calculated afiesh. In either case the variance measure used can either be 
30 traditional variance or the sum of differences of neighbouring pixels (gradient) or any other 
variance^type measure. 

Figure 23 schematically illustrates a video conferencing system. Two video 
conferencing stations 1100, 1110 sffe connected by a network connection 1120 such as: the 
Internet, a local or wide area network, a telq>hone line, a hi^ bit rate leased line, an ISDN 
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line etc. Each of the stations comprises^ in simple tenns, a camera and associated sending 
apparatus 1130 and a display and associated receiving apparatus 1140. Participants in the 
video conference are viewed by the camera at their respective station and their voices are 
picked up by one or more microphones (not shown in Figure 23) at that station- The audio 
5 and video information is transmitted via the network 1120 to the receiver 1140 at the other 
station. Here, images c^tured by the camera are displayed and the participants* voices are 
produced on a loudspeaker or the Kke. 

It will be appreciated that more than two stations may be involved in the video 
conference, although the discussion here will be limited to two stations for simplicity. 
10 Figure 24 schematically illustrates one channel, being the connection of one 

camera/sendhig apparatus to one display/receiving apparatus. 

At the camera/sending apparatus, there is provided a video camera 1150, a face 
detector 1160 using the techniques described above, an image processor 1170 and a data 
formatter and transmitter 1 1 80. A microphone 1 190 detects the participants' voices. 
15 Audio, video and (optionally) metadata signals are transmitted from the formatter and 

transmitter 1180, via the network connection 1120 to the display/receiving apparatus 1140. 
Optionally, control signals are received via the network connection 1120 from the 
display/receiving apparatus 1140. 

At the display/receiving apparatus, there is provided a display and display processor 
20 1200, for example a display screen and associated electronics, user controls 1210 and an 
audio output arrangement 1220 such as a digital to analogue (DAC) converter, an amplifier 
and a loudspeaker. 

In g^eral terms, the face detector 1160 detects (and optionally tracks) feces in the 
c^tured images from the cameara 1150. The fece detections are passed as control signals to 

25 flie image processor 1170. The image processor can act in various different ways, which will 
be described below, but fimdamentally the image processor 1 170 alters tiie images c^tured 
by the camera 1 150 before they are transmitted via flie network 1 120. A significant purpose 
behind this is to make better use of tihe available bandwidth or bit rate which can be carried 
by the network coimection 1 120. Here it is noted that in most commercial applications, the 

30 cost of a networic connection 1120 suitable for video conference purposes mcreases with an 
increasing bit rate requirement. At flie formatter and transmitter 1180 the images from the 
image processor 1170 are combined with audio signals fix)m tiie microphone 1190 (for 
example, having been converted via an analogue to digital converter (ADC)) and optionally 
metadata defining the nature of the processing carried out by the image processor 1 170. 
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Various modes of operation of the video: conferencing system will be desoibed 

below. 

Figure 25 is a further schematic representation of the video conferencing system. 
Here, the functionality of the face detector 1 160, the image processor 1 170, the fomiatter and 
5 transmitter 1180 and the processor aspects of the display and display processor 1200 are 
carried out by programmable personal computers 1230. The schCTuatic displays shown on 
the display screens ^art of 1200) represent one possible mode of video conferencing using 
face detection which will be described below wifli reference to Figure 31, namely that only 
those image portioris containing faces are transmitted from one location to the other, and are 
10 then displayed in a tiled or mosaic fomi at the other location. As mentioned, this mode of 
opCTadon will be discussed below. 

Figure 26 is a flowchart schematically illustratiag a mode of operation of the system 
of Figures 23 to 25. The flowcharts of Figures 26, 28, 31, 33 and 34 are divided into 
operations carried out at the camera/sender end (1130) and those carried out at the 
15 display/receiver end (1 140). 

So, referring to Figure 26, the camera 1 150 captures images at a step 1300. At a step 
1310, the face detector 1160 detects faces in the captured images. Ideally, face tracking (as 
described sibove) is used to avoid any spruious interruptions in the fece detection and to 
provide that a particular person's face is treated in the same way throughout the video 
20 conferencing session. 

At a stqp 1320, the image processor 1170 crops the c^tured images in response to 
the fece detection information. This may be done as follows: 

first, identify &e upper left-mort fece detected by the 
. detect the upper left-most exireme of tiiat face; this forms the upper left 
25 comer of the cropped image 

. repeat for the lower ri^t-most face and the lower rigjit-most extreme of that 

face to form the lower right comer of the cropped image 
. crop the image in a rectangular sh25)e based on these two co-ordioates. 

30 The cropped image is thai transmitted by the formatter and transmitted 1 180. In this 

instance, tiiere is no need to transmit additional metadata. The ax>pping of flie image allows 
eitiier a reduction in bit rate compared to the full image or an improvement in transmission 
quality while maintaining &e same bit rale. 
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At the receiver, the cropped image is displayed at a full screen display at a step 1 130. 

Optionally, a user control 1210 can toggle the image processor 1 170 between a mode 
in which the image is cropped and a mode in which it is not cropped. This can allow the 
participaQts at the receiver &id to see either the whole room or just the face-related parts of 
5 the image. 

Another technique for cropping the image is as follows: 

• identify the leftmost and rightmost faces 

• maintaining the aspect ratio of the shot, locate the faces in the upper half of the 
picture. 

10 In an altonative to cropping, the camera could be zoomed so that the detected faces 

are featured more significantly in the transmitted images. This could, for exanople, be 
combined with a bit rate reduction technique on the resulting image. To achieve ttiis, a 
control of the directionai (pan/tilt) and lens zoom properties of the camera is made available 
to the image processor (represented by a dotted line 1 1 55 in Figure 24) 
15 Figures 27a and 27b are example images relating to the flowchart of Figure 26. 

Figure 27a represents a foil screen image as captured by the camera 1 150, whereas Figure 
27b represents a zoomed version of that image. 

Figure 28 is a flowchart schematically illustrating another mode of operation of the 
system of Figures 23 to 25. Step 1300 is the same as that shown in Figure 26. 
20 At a step 1340, each face in the captured images is identified and highlighted, for 

example by drawing a box around that face for display. Each &ce is also labelled, for 
example with an arbitrary label a, b, c. . Here, face tracking is particularly useful to avoid 
any subsequent confusion ovct the labels. The labelled image is formatted and transmitted to 
the receiver whwe it is displayed at a st^ 1350. At a step 1360, the user selects a fece to be 
25 diisplayed, for example by typing the label relating to that face. The selection is passed as 
control data back to the image processor 1 170 which isolates the required face at a step 1370. 
The required face is transmitted to the receiver. At a st&p 1380 the required face is 
displayed. The user is able to select a different face by the step 1360 to replace the curraitiy 
displayed &ce. Again, ttiis arrangement allows a potential saving in bandwidth, in that the 
30 selection screen may be transmitted at a lower bit rate because it is only used for selecting a 
&ce to be displayed. Alternatively, as before, the individual faces, once selected, can be 
transmitted at an enhanced bit rate to adueve a bett^ quality image. 
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Figure 29 is an example image relating to the flowchart of Figure 28. Here, three 
faces have been identified, and are labelled a, b and c. By typing one of those three lett^ 
into the user controls 1210, the user can select one of those faces for a full-screen display. 
This can be achieved by a cropping of the main image or by the camera zooming onto that 
5 face as described above. Figure 30 shows an alternative representation, in which so-called 
thmnbnail images of each face are displayed as a menu for selection at the receiver. 

Figure 31 is a flowchart schematically illustrating a further mode of operation of the 
system of Figures 23 to 25. The steps 1300 and 1310 correspond to those of Figure 26. 

At a step 1400, &e image processor 1170 and the formatter and transmitter 1180 c6- 
10 operate to transmit only thumbnail images relating to the captured fEices. These are 
displayed as a menu or mosaic of faces at the receiver end at a step 1410. At a step 1420, 
optionally, tiie user can select just one face for enlarged display^ This may involve keeping 
the other faces displayed in a smaller format on the same screen or the other faces may be 
hidden while the enlarged display is used. So a difference between this arrangement and that 
15 of Figure 28 is that thumbnail images relating to all of the faces are transmitted to the 
receiver, and the selection is made at the receiver end as to how the thumbnails are to be 
displayed. 

Figure 32 is an example image relating to the flowchart of Figure 31. Here, an initial 
screen could show three thumbnails, 1430, but flie stage illustrated by Figure 32 is that the 
20 &ce belonging to participant c has been selected for eidarged display on a left hand part of 
the display screen. Howev^, the thumbnails relating to the other participants are retained so 
that the user can make a sensible selection of a nesd: &ce to .be displayed in CTlarged form. 

It should be noted tiiat, at least in a system wherie. the main image is cropped, the 
thumbnail images refeired to in tiiese examples are "^ve" thumbnail images, albeit taking 
25 into account any processing delays present in tiie system. That is to say, the thumbnail 
images vary in time, as the captured images of the participants vary. In a system usii^ a 
camera zoom, then the thumbnails could be static or a second camoa could be used to 
capture the wider angle scene. 

Figure 33 is a flowchart sdiematically illustrating a furdier naode of operation. Here, 
30 the st^s 1300 and 1310 coirespond to those of FigiHe 26. 

At a step 1440 a thumbnail face image relating to the face detected to be nearest to an 
active microphone is transmitted. Of course, this relies on having more than one microphone 
and also a pie-selection or metad^ defining which participant is sitting near to which 
microphone. This canbeset iq> in advance by a single meau-diiven table eiriry by the users 
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at each video conferencing station. The active microphone is considered to be the 
microphone having the greatest magnitude audio signal averaged over a certain time (such as 
one second). A low pass filtering arrangement can be used to avoid changing the active 
microphone too often, for example in response to a cough or an object being dropped, or two 
5 participants speaking at the same time. 

At a step 1450 the transifaitted face is displayed. A step 1460 r^resents the quasi- 
continuous detection of a current active microphone. 

The detection could be, for example, a detection of a single active microphone or 
alternatively a siniple triangulation technique could detect the speaker's position based on 
10 multiple microphones. 

Finally, Figure 34 is a flowchart schematically illustrating anoflier mode of operation, 
agaik in which the steps 1300 and 1310 correspond to those of Figure 26. 

At a step 1470 the parts of the captured images immediately surrounding each face 
are trar^mitted at a higher resolution and the background (other parts of the c^tured images) 
15 is transmitted at a lower resolution. This can achieve a useftil saving in bit rate or allow an 
enhancement of the parts of the image surroxmding each face. Optionally, metadata can be 
transmitted defining the position of each face, or tiie positions may be derived at the receiver 
by noting the resolution of different parts of the image. 

At a step 1480, at tiie receive end the image is displayed and the faces are optionally 
20 labelled for selection by a user at a step 1490 this selection could cause the selected face to 
be displayed in a larger format similar to the arrangement of Figure 32. 

Alihou^ the description of Figures 23 to 34 has related to video conferencing 
systems, the same techniques could be appUed to, for example, security monitoring (CCTV) 
systems. Herci a return chaxmel is not nonnally required, but an arrangement as shown in 
25 Figmre 24, where the camera / sender arrangement is provided as a CCTV camera, and the 
receiver / display arrahgemeht is provided at a monitoring site, could use the same 
techniques as those described for video conferencing. 

It v^rill be appreciated that the embodiments of tiie invention described above may of 
course be implemented, at least in part, using software-controlled data processing apparatus. 
30 For ex:aiiq>le, one or more of the conqKments schematically illustrated or described above may 
be implemented as a software-controUed general purpose data pn>cessiiig device or a bespoke 
program controlled data processing device such as an qyplication specific integrated circuit, a 
field programmable gate array or the like. It will be q>preciated &at a campvtex program 
providing such software or program control and a storage, transnrrissjon or other providing 
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medium by which such a co]iq>uter program is stored are envisaged as aspects of tiie present 
inveiitioiL 

The list of references and appendices follow. For the avoidance of doubt, it is noted 
5 that the list and the appendices form a part of the present descriptiorL These documents are 
all incorporated by reference. 
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Appendix A; Training Face Sets 

One database consists of many thousand images of subjects standing in fiont of an indoor 
25 background Another training database used in e3q>etimental inq[>lementations of the aibove 
techniques consists of more than ten thousand eight-bit greyscale images of human heads 
wifli views ranging Scorn frontal to left and right profiles. The skilled man will of course 
imderstand that various diff^ent training sets could be used, optionally being profiled to 
reflect facial characteristics of a local population. 
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Appendix B — Eigenblocks 

In the eigenface approach to face detection and recognition (References 4 and 5), 
5 each m-by-n face image is reordered so that it is represented by a vector of length mn. Each 
image can then be thought of as a point in mn-dimensional space. A set of images maps to a 
collection of points in this large space. 

Face images, being siioilar in overall configuration^ are not randomly distributed in 
tins mra-dimeosional image space and therefore they can be described by a relatively low 
10 dimensional subspace. Using principal component analysis (PCA), the vectors fliat best 
account for the distribution of fsice images within the entire image space can be found. PCA 
involves determining the principal eigenvectors of titie covariance matrix coiresponding to 
the original &ce images. These vectors define the subspace of face images, often referred to 
as the face space. Each vector represents an m-by-n image and is a linear combination of the 
15 original face images. Because the vectors are the eigenvectors of the covariance matrix 
corresponding to the original face images, and because they are face-like in appearance, they 
are often referred to as eigenfaces [4]. 

Wh^ an unknown image is presented, it is projected into the face space. In this way, 
it is expressed in terms of a weighted sum of eigenfaces. 
20 In the present embodiments, a closely related approach is used, to generate and apply 

so-called "eigenblocks" or eigenvectors relating to blocks of the face image. A grid of 
blocks is ^plied to die face image (in the training set) or the test window (during the 
detection phase) and an eigenvector-based process, v^ similar to the dLgenface process, is 
^pUed at each block position. (Or in an alternative embodiment to save on data processing, 
25 the process is applied once to the group of block positibns, producing one set of eigenblocks 
for use at any block position). The skSled man will understand that some blocks, such as a 
central block often representing a nose feature of the image, may be more significant in 
deciding whether a face is present. 

30 Calculating Eigenblocks 

The calculation of eigenblocks involves the following steps: 
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(1) . A training set of Nj. images is used. These are divided into image blocks eaidi of 
size m-Kn. So, for each block position a set of image blocks, one from that position in each 

image, is obtained: {^o')!li- 

(2) . A normaUsedtraiiung set of blocks {?'^^, is calculated as follows: 
Each image block, I J , from the original training set is normalised to have a mean of 

zero and an L2-norm of 1, to produce a respective normalised imag^ ]t>lodc, /' . 
For each image block, /^ ^ = 1 .JSTj. : 

/ —mean / 
"||/;-meaii_J/| 

where mean_/^' = ^^^JlUJI 

and |/;-inean^V| = ,j^^^ 

(i.e. the L2-norm of {fj^ mean_/^' )) 
(3). A training set of vectors ^^X^x is formed by lexicographic reordering of the pixel 
elements of each image block, J' • i.e. Each m-by-n image block, /' , is reordered into a 
vector, X* , of length N-rmu 
15 (4). The set of deviation vectors, 2> = {3c'}ili, is calculated. D has // rows and Nj, 
colunms. 

(5). The covariance matrix, S , is calculated: 

S is a symmetric matrix of size JSTx iVl 
20 (7). The whole set of eigenvectors, and eigenvalues, A, , i = 1,-.,// , of the covariance 
matrix, 2 , are given by solving: 

Here, A is snNxN diagonal matrix with tiie eigenvalues, , along its diagonal (in 
order of magnitude) and Pis^aiNxN matrix containing the set of iV^ eigenvectcars, each of 
25 length N. This decomposition is also known as a Karhunen-Loeve Transform O^LT). 
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The eigenvectois can be thought of as a set of features that together characterise the 
variation between the blocks of the face nnages. They fonn an orthogonal basis by which 
..any image block can be represented, i.e. in principle any image can be represented without 
error by a weighted sum of the eigenvectors. 
5 If flie number of data points in the image space (the number of training images) is 

less than the dunension of the space (JNTj. ^^^^ will only be Nj. meaningful 

eigenvectors. The remaining eigenvectors wiU have associated eigenvalues of zero. Hence, 
because typically Nj. < N , aU eigravalues for which i > ATj. will be zero. 

Additionally, because the image blocks in the training set are similar in overall 
10 configuration (they are all derived firom faces), only some of the remaining eigenvectors will 
characterise very strong diJBFerences between the image blocks. These are the eigenvectors 
with the largest associated eigenvalues. The other remaining eigenvectors with smaller 
associated eigenvalues dp not characterise such large differ^ces and therefore they are not 
as useful for detecting or distinguishing between faces. 
15 Therefore, in PCA, only the M principal eigenvectors with the largest magnitude 

eigenvalues are considered, where M <Nj^ ie. a partial KLT is performed. In short, PCA 
extracts a lower-dimenisional subq>ace of the KLT basis corresponding to the largest 
magnitude eigenvalues. 

Because the principal components describe the strongest variations between the fece 
20 images, in appearance they may resemble parts of face blocks and are referred to here as 
eigenblocks. However, the term e/g^CTivec/ons could equally be used. 

Face Detection using Eigi&nblocks 

The similarity of an unknowri iniage to a fece, or its facenessy can be measured by 
25 determining how well the image is represented by the face space. This process is carried out 
on a block"by-block basis, usmg the same grid of blocks as fliat used in the training process. 

The first stage of this process involves projecting the image into the face space. 

Projection of an Image into Face Space 
30 Before projecting an image into face space, much the same pre-processing steps are 

performed on the image as were performed on the training set: 
(1). A test image block of size m x n is obtained: . 
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(2). The origiBal test image block, is normalised to have a mean of zero and an 12- 

norm of 1, to produce the normalised test image block, / : 

-mean_/^ 



: l/p-niean_/,|| 



1 , J* , 

where mean_/„ = X 2 ^« » 



and -m^jA^A^ 



(i.e. the L2-norm of (/^ - mean_/^ )) 

(3) . The deviation vectors are calculated by lexicographic reordering of the pixel 
elemmts of the image. The image is reordered into a deviation vector, x' , of length N^mn. 

After these pre-processing steps, the deviation vector, x, is projected into face space 
10 using the following simple step: 

(4) . The projection into fece space involves transforming the deviation vector, x, into its 
eigenblock components. This involves a simple multiplication by ttie M principal 
eigenvectors (the eigenblocks), J), i = 1,..,M . Each weight is obtained as follows: 

15 where is the eigenvector. 

The weights 3/,., z=l,..,Af, describe the contribution of each eigenblock in 

representing the input face block. 

Blocks of similar s^pearance will have similar sets of weights while blocks of 
different appearance will have different sets of wdghts. Therefore, the wei^ts are used here 
20 as feature vectors for classifying fece blocks during fece detection. 



